tutorial_clustering (Score: 36.5 / 39.0)

  1. Test cell (Score: 1.0 / 1.0)
  2. Test cell (Score: 1.0 / 1.0)
  3. Test cell (Score: 1.0 / 1.0)
  4. Test cell (Score: 1.0 / 1.0)
  5. Written response (Score: 1.5 / 3.0)
  6. Comment
  7. Test cell (Score: 1.0 / 1.0)
  8. Test cell (Score: 1.0 / 1.0)
  9. Written response (Score: 3.0 / 3.0)
  10. Test cell (Score: 1.0 / 1.0)
  11. Test cell (Score: 1.0 / 1.0)
  12. Written response (Score: 3.0 / 3.0)
  13. Coding free-response (Score: 2.0 / 3.0)
  14. Comment
  15. Test cell (Score: 0.0 / 0.0)
  16. Written response (Score: 3.0 / 3.0)
  17. Test cell (Score: 3.0 / 3.0)
  18. Test cell (Score: 3.0 / 3.0)
  19. Written response (Score: 3.0 / 3.0)
  20. Coding free-response (Score: 3.0 / 3.0)
  21. Test cell (Score: 1.0 / 1.0)
  22. Written response (Score: 3.0 / 3.0)

Tutorial 10 - Clustering

Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

  • Describe a case where clustering would be an appropriate tool, and what insight it would bring from the data.
  • Explain the K-means clustering algorithm.
  • Interpret the output of a K-means cluster analysis.
  • Perform K-means clustering in R
  • Visualize the output of K-means clustering in R using a coloured scatter plot
  • Identify when it is necessary to scale variables before clustering and do this using R
  • Use the elbow method to choose the number of clusters for k-means
  • Describe advantages, limitations and assumptions of the kmeans clustering algorithm.

This worksheet covers parts of the Clustering chapter of the online textbook. You should read this chapter before attempting the worksheet.

In [1]:
### Run this cell before continuing.
library(tidyverse)
library(tidymodels)
library(tidyclust)
library(repr)
library(GGally)
options(repr.matrix.max.rows = 6)
source('tests.R')
source("cleanup.R")
Warning message:
“package ‘ggplot2’ was built under R version 4.3.2”
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
 dplyr     1.1.3      readr     2.1.4
 forcats   1.0.0      stringr   1.5.0
 ggplot2   3.5.0      tibble    3.2.1
 lubridate 1.9.2      tidyr     1.3.0
 purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()
 Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

 broom        1.0.5      rsample      1.2.0
 dials        1.2.0      tune         1.1.2
 infer        1.0.5      workflows    1.1.3
 modeldata    1.2.0      workflowsets 1.0.1
 parsnip      1.1.1      yardstick    1.2.0
 recipes      1.0.8     

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
 scales::discard() masks purrr::discard()
 dplyr::filter()   masks stats::filter()
 recipes::fixed()  masks stringr::fixed()
 dplyr::lag()      masks stats::lag()
 yardstick::spec() masks readr::spec()
 recipes::step()   masks stats::step()
 Learn how to get started at https://www.tidymodels.org/start/

Attaching package: ‘tidyclust’


The following objects are masked from ‘package:parsnip’:

    knit_engine_docs, list_md_problems


Warning message:
“package ‘GGally’ was built under R version 4.3.2”
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Attaching package: ‘testthat’


The following object is masked from ‘package:rsample’:

    matches


The following object is masked from ‘package:dplyr’:

    matches


The following object is masked from ‘package:purrr’:

    is_null


The following objects are masked from ‘package:readr’:

    edition_get, local_edition


The following object is masked from ‘package:tidyr’:

    matches


Attaching package: ‘rlang’


The following objects are masked from ‘package:testthat’:

    is_false, is_null, is_true


The following objects are masked from ‘package:purrr’:

    %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
    flatten_raw, invoke, splice


1. Pokemon

We will be working with the Pokemon dataset from Kaggle, which can be found here. This dataset compiles the statistics on 721 Pokemon. The information in this dataset includes Pokemon name, type, health points, attack strength, defensive strength, speed points etc. These are values that apply to a Pokemon's abilities (higher values are better). We are interested in seeing if there are any sub-groups/clusters of pokemon based on these statistics. And if so, how many sub-groups/clusters there are.

Source: https://media.giphy.com/media/3oEduV4SOS9mmmIOkw/giphy.gif

Question 1.0
{points: 1}

Use read_csv to load pokemon.csv from the data/ folder.

Assign your answer to an object called pokemon_full.

In [2]:
Student's answer(Top)
# your code here
#fail() # No Answer - remove if you provide an answer
pokemon_full <- read_csv("data/pokemon.csv")
Rows: 800 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Name, Type 1, Type 2
dbl (9): #, Total, HP, Attack, Defense, Sp. Atk, Sp. Def, Speed, Generation
lgl (1): Legendary
 Use `spec()` to retrieve the full column specification for this data.
 Specify the column types or set `show_col_types = FALSE` to quiet this message.
In [3]:
Grade cell: cell-56bc2b189beffa61 Score: 1.0 / 1.0 (Top)
test_1.0()
Test passed 🎊
Test passed 🎉
Test passed 😀
Test passed 😸
[1] "Success!"

Question 1.1
{points: 1}

To start exploring the Pokemon data, create a scatter plot matrix (or pairplot) using ggpairs. The plot should only contain the columns Total to Speed from pm_data. You can check the data wrangling chapter in the textbook to recall how to select a range of columns using select with :.

Assign your answer to an object called pokemon_pairs. Make sure to set a suitable size for the plot.

In [4]:
Student's answer(Top)
options(repr.plot.height = 12, repr.plot.width = 12)
#
pokemon_pairs <- pokemon_full |> select(Total:Speed) |>
      ggpairs(aes(alpha = 0.05)) +
      theme(text = element_text(size = 20))

# your code here
#fail() # No Answer - remove if you provide an answer
pokemon_pairs
No description has been provided for this image
In [5]:
Grade cell: cell-4a0f59f0f380baad Score: 1.0 / 1.0 (Top)
test_1.1()
Test passed 🌈
Test passed 😸
Test passed 😸
Test passed 🎉
[1] "Success!"

Question 1.2
{points: 1}

From the pairplot above, it does not look like the pokemon are separated into clear groups in any of the pairwise variable scatterplots. Here, we will continue exploring the relationship between Speed and Defense and see what happens if we try to cluster the data points on these two variables although there are no visually discernable variables in the chart.

First, select the columns Speed and Defense, creating a new dataframe with only those columns.

Assign your answer to an object named pokemon.

In [6]:
Student's answer(Top)
# your code here
#fail() # No Answer - remove if you provide an answer
pokemon <- select(pokemon_full, "Speed", "Defense")
pokemon
A tibble: 800 × 2
SpeedDefense
<dbl><dbl>
4549
6063
8083
70 60
80 60
70120
In [7]:
Grade cell: cell-b62427bf6857eae0 Score: 1.0 / 1.0 (Top)
test_1.2()
Test passed 🎉
Test passed 😀
[1] "Success!"

Question 1.3
{points: 1}

Next, create a scatter plot of only these two variables so that we can look close at their relationship. Put the Speed variable on the x-axis, and the Defense variable on the y-axis.

Assign your plot to an object called pokemon_scatter. Don't forget to do everything needed to make an effective visualization, including setting an appropriate alpha value of the points.

In [8]:
Student's answer(Top)
# your code here
#fail() # No Answer - remove if you provide an answer
options(repr.plot.height = 7, repr.plot.width =7 )

pokemon_scatter <- pokemon|> 
                ggplot (aes (x = Speed, y = Defense))+
                geom_point(alpha = 0.3478956)+
                labs(x = "Speed (Base modifier)", y = "Base Damage Resistance")

pokemon_scatter
No description has been provided for this image
In [9]:
Grade cell: cell-1b9d09e256e4b281 Score: 1.0 / 1.0 (Top)
test_1.3()
Test passed 🌈
Test passed 😸
Test passed 🌈
Test passed 😸
Test passed 🥇
[1] "Success!"

Question 1.4.1
{points: 3}

The chart above confirms what we saw in the pairplot; there doesn't seem to be visually distinct clusters of points in these two dimensions. Could it still be informative to run clustering with this data? Let's find out by using K-Means to cluster the Pokemon based on their Speed and Defense.

So far when using K-Means, we have scaled our input features. Will it matter much for our clustering if we scale our variables for the pokemon data? Is there any argument against scaling here?

Student's answer Score: 1.5 / 3.0 (Top)

I think it could still be extremely informative to run clustering on this data. One piece of information we could potentially gather could be about the pokemon type. We can cluster to see if water types tend towards a particular attribute, or vice versa.

Scaling always matters. It means that every variable will be more comparable, and on the same scale. This means that the distance between neighbors will always be "accurate" and predictable.

The only argument against scaling is that values will always be a little distorted, in that you won't be able to tell what the highest value is. (Eg, can't tell the max attack, since it'll just be 1). Though, this is a problem that happens due to scaling, and will always happen.

Question 1.4.2
{points: 1}

Now, let's use K-means to cluster the Pokemon based on their Speed and Defense variables.

  • Create a recipe named pokemon_recipe that standardizes the data
  • Create a model specification named pokemon_spec for K-means clustering with 4 clusters.
  • Fit the model using a tidymodels workflow; call the output of the fit() function pokemon_clustering.

Assign your answers to objects called pokemon_recipe, pokemon_spec, and pokemon_clustering.

Note: We set the random seed here because K-means initializes observations to random clusters.

In [10]:
Student's answer(Top)
#DON'T CHANGE THE SEED VALUE BELOW!
set.seed(2019)

pokemon_recipe <- recipe( ~ . , pokemon) |> 
        step_scale(all_predictors()) |>
        step_center(all_predictors())

pokemon_spec <- k_means(num_clusters = 4) |>
         set_engine("stats")

pokemon_clustering <- workflow() |>
     add_recipe(pokemon_recipe) |>
     add_model(pokemon_spec) |>
     fit(data = pokemon)

# your code here
#fail() # No Answer - remove if you provide an answer
pokemon_clustering
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: k_means()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_scale()
• step_center()

── Model ───────────────────────────────────────────────────────────────────────
K-means clustering with 4 clusters of sizes 345, 139, 202, 114

Cluster means:
       Speed    Defense
4 -0.6081010 -0.6445867
3  0.6296539  0.9144968
1  1.1566773 -0.3181717
2 -0.9769812  1.3994566

Clustering vector:
  [1] 1 1 2 2 1 3 3 2 3 1 1 2 2 1 1 1 1 1 1 3 1 1 3 3 1 3 1 3 1 3 3 3 4 2 1 1 2
 [38] 1 1 3 1 1 1 3 1 1 1 3 1 1 1 1 1 1 3 3 3 3 3 1 3 1 3 1 3 3 3 2 3 3 3 3 1 1
 [75] 1 1 1 1 1 3 4 4 4 3 3 1 4 4 1 2 1 1 3 1 2 1 1 4 4 1 3 3 3 4 1 1 4 2 3 3 1
[112] 1 4 4 3 2 1 4 4 4 4 1 4 3 2 1 2 1 1 3 3 3 3 3 3 3 2 2 2 3 2 2 1 1 1 1 3 1
[149] 1 4 4 4 2 3 3 1 2 3 2 1 1 2 3 2 3 2 1 1 2 1 3 3 1 1 2 1 3 1 1 1 3 1 1 3 1
[186] 1 1 1 1 1 4 1 3 1 1 1 4 4 1 1 4 1 1 3 3 3 1 1 3 1 4 3 2 3 1 3 1 1 3 4 4 1
[223] 2 4 4 1 1 3 2 2 4 3 2 3 1 1 1 4 1 1 4 1 1 1 1 4 1 3 3 2 1 4 2 3 1 1 2 1 3
[260] 3 2 1 3 3 2 1 1 4 4 2 2 2 1 3 3 3 1 1 3 3 1 1 2 2 1 1 1 3 1 1 1 1 1 1 1 1
[297] 1 1 3 3 3 3 2 1 1 3 3 1 1 1 2 1 3 2 4 3 1 1 1 1 1 1 1 4 1 1 1 4 1 4 4 4 4
[334] 4 1 3 3 1 3 3 3 3 3 3 1 1 1 1 3 3 1 1 1 1 4 4 1 3 1 1 1 3 1 1 1 2 2 3 1 1
[371] 2 1 1 1 1 1 2 1 4 1 4 1 2 1 1 1 1 2 4 4 1 1 1 3 1 1 2 3 1 1 2 4 4 4 4 3 1
[408] 4 3 2 1 4 2 2 4 4 4 3 2 3 2 2 2 2 2 2 2 2 3 3 2 3 1 4 4 1 3 3 1 1 2 1 3 3
[445] 1 1 1 1 1 1 2 1 3 1 1 4 4 1 4 4 4 1 1 4 3 3 3 1 3 1 1 3 1 3 3 3 3 3 1 3 3
[482] 1 1 3 4 4 4 1 1 3 4 1 3 2 2 1 1 3 3 1 4 2 2 1 3 1 1 3 1 1 1 4 3 4 4 4 4 3
[519] 3 2 2 2 2 2 2 3 3 2 4 4 3 3 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 3 2 3 2 2 1
[556] 3 2 1 1 1 1 1 2 1 3 1 1 2 1 3 1 3 1 3 1 3 1 4 1 1 3 1 3 4 4 4 1 3 1 3 1 4
[593] 1 4 4 1 1 2 1 3 1 4 3 1 4 3 1 3 1 3 3 1 1 3 1 3 4 1 1 4 1 4 3 4 4 4 4 1 3
[630] 1 2 1 3 1 3 1 1 2 1 1 1 1 3 1 1 2 1 3 3 1 4 1 1 1 1 1 1 3 4 4 1 4 2 1 1 1
[667] 1 1 1 1 2 1 1 2 1 1 3 4 3 4 1 3 4 1 1 1 2 4 1 3 1 2 1 2 1 1 2 1 3 2 2 3 3
[704] 3 3 3 2 2 2 2 2 2 2 2 2 3 3 2 1 4 4 1 1 3 1 3 3 1 2 1 3 3 1 1 3 1 3 1 1 3
[741] 1 1 1 1 3 1 3 3 4 4 1 4 1 1 1 2 1 2 1 2 1 4 1 2 1 3 1 2 1 1 1 3 3 4 1 1 3
[778] 2 1 1 1 1 1 1 2 2 2 4 4 4 1 3 2 2 2 4 2 1 3 2

Within cluster sum of squares by cluster:
[1] 187.20769  74.66872 123.33522 147.72827
 (between_SS / total_SS =  66.6 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
In [11]:
Grade cell: cell-b899d544a0fec2ea Score: 1.0 / 1.0 (Top)
test_1.4.2()
Test passed 🎉
Test passed 😀
Test passed 😸
Test passed 🎊
Test passed 🎉
Test passed 🎉
Test passed 🎊
Test passed 😸
Test passed 🌈
Test passed 🎊
Test passed 🥇
Test passed 😸
[1] "Success!"

Question 1.5
{points: 1}

Let's visualize the clusters we built in pokemon_clustering. Use the augment function and create a coloured scatter plot of Speed (x-axis) vs Defense (y-axis) with the points coloured by their cluster assignment.

Name this plot pokemon_clustering_plot.

In [12]:
Student's answer(Top)
# your code here
#fail() # No Answer - remove if you provide an answer

aug <- augment(pokemon_clustering ,pokemon)

pokemon_clustering_plot <- aug|>
        ggplot (aes (x = Speed, y = Defense, colour = .pred_cluster))+
        geom_point(alpha = 0.62)+
        labs(x = "Speed (Base modifier)", y = "Base Damage Resistance", colour = "Cluster Type")

pokemon_clustering_plot
No description has been provided for this image
In [13]:
Grade cell: cell-2d872f629c06cf4d Score: 1.0 / 1.0 (Top)
test_1.5()
Test passed 🌈
Test passed 🥇
Test passed 🎊
Test passed 🎊
Test passed 😀
[1] "Success!"

Question 1.6
{points: 3}

Below you can see multiple initializations of k-means with different seeds for K = 4. Can you explain what is happening and how we can mitigate this in the kmeans function?

Student's answer Score: 3.0 / 3.0 (Top)

When you have different values in your set.seed argument, you are using different patterns of "randomness". We use seeds to randomize how we group the data we have. In this case, it is we are randomizing our cluster types. As a result of using different seeds, our things will be grouped differently! This can also lead to a differing number of "groups"

When using kmeans, we will typically use num_cluster to tell the algorithm to cluster (or sort) everything into "n" number of groups. Doing this means that there was always only be n clusters, which leads to more consistent results. We should also combine it with the "n_start =" argument to further make our data consistent.

Question 1.7
{points: 1}

We know that comparing how the WSSD varies for multiple values of $K$ is an important step of selecting a suitable clustering model. That's what we will do next!

For this exercise, you will calculate the total within-cluster sum-of-squared distances for $K$ = 1 to $K$ = 10.

  1. Create a tibble with the desired values of $K$.
  2. Create a new model specification that sets nstart to 10 and tells k_means you want to tune the number of clusters.
  3. Create a new workflow that uses tune_cluster to tune the number of clusters
  4. Use the collect_metrics function to collect the results.
  5. Use filter, select, and mutate functions to construct a tibble with two columns named num_clusters and total_WSSD. Store that tibble in an object named elbow_stats.

Assign your answer to a tibble object named elbow_stats. It should have the columns num_clusters and total_WSSD.

In [14]:
Student's answer(Top)
set.seed(2020) # DO NOT REMOVE

# your code here
#fail() # No Answer - remove if you provide an answer


k_vals <- tibble(num_clusters = 1:10)
spec_tune <- k_means(num_clusters = tune()) |>
         set_engine("stats", nstart = 10)

kmeans_tuning_stats <-workflow()|>
        add_recipe(pokemon_recipe) |>
        add_model(spec_tune)|>
        tune_cluster(resamples = apparent(pokemon), grid = k_vals) |>
        collect_metrics()

elbow_stats <- kmeans_tuning_stats |>
         mutate(total_WSSD = mean) |>
         filter(.metric == "sse_within_total") |>
         select(num_clusters, total_WSSD)

elbow_stats
A tibble: 10 × 2
num_clusterstotal_WSSD
<int><dbl>
11598.0000
21031.4596
3 640.4274
8267.6408
9235.6118
10214.2753
In [15]:
Grade cell: cell-4431d9afb3d2822e Score: 1.0 / 1.0 (Top)
test_1.7()
Test passed 🎉
Test passed 😀
Test passed 🥇
Test passed 🌈
Test passed 🎉
Test passed 😀
[1] "Success!"

Question 1.8
{points: 1}

Let's visualize how WSSD changes for as we vary the value of $K$. To do this, create the elbow plot. Put the within-cluster sum of squares on the y-axis, and the number of clusters on the x-axis.

Assign your plot to an object called elbow_plot.

In [16]:
Student's answer(Top)
# your code here
#fail() # No Answer - remove if you provide an answer
elbow_plot <- elbow_stats|> 
                ggplot (aes (x = num_clusters, y = total_WSSD))+
                geom_line()+
                geom_point()+
                xlab("K value")+
                ylab("Total Within Cluster \n Sum of Squares")+
                scale_x_continuous(breaks = seq(0, 10, by = 1))

elbow_plot
No description has been provided for this image
In [17]:
Grade cell: cell-9be55abc471b0f7f Score: 1.0 / 1.0 (Top)
test_1.8()
Test passed 🎊
Test passed 😀
Test passed 🎉
Test passed 🎉
Test passed 🎉
[1] "Success!"

Question 1.9
{points: 3}

Based on the elbow plot above, what value of $K$ would you choose? Explain why.

Student's answer Score: 3.0 / 3.0 (Top)

I would pick K = 3.

K = 3 is chosen because when we increase K futher, the WSSD decreases by an insignficant amount. Having a K value that's too high could also cause problems, one of them being over-clustering (too different clusters).

If we chose K = 1 or 2, it would cluster together too many things. This will inturn, cause signficant issues in terms of grouping. (Too many unrelated things grouped together).

Question 1.10
{points: 3}

Using the value that you chose for $K$, perform the K-means algorithm, set nstart = 10 and assign your answer to an object called pokemon_final_kmeans.

Augment the data with the final cluster labels and assign your answer to an object called pokemon_final_clusters.

Finally, create a plot called pokemon_final_clusters_plot to visualize the clusters. Include a title, colour the points by the cluster and make sure your axes are human-readable.

In [18]:
Student's answer Score: 2.0 / 3.0 (Top)
set.seed(2019) # DO NOT REMOVE
# your code here
#fail() # No Answer - remove if you provide an answer

pokemon_final_kmeans <- k_means(num_clusters = 3) |>
         set_engine("stats", nstart = 10)

pkm_kmeans <- workflow()|>
                add_recipe(pokemon_recipe)|>
                add_model(pokemon_final_kmeans)|>
                fit(data = pokemon)

pokemon_final_clusters <- augment(pkm_kmeans, pokemon)

pokemon_final_clusters_plot <- pokemon_final_clusters|>
        ggplot (aes (x = Speed, y = Defense, colour = .pred_cluster))+
        geom_point(alpha = 0.62)+
        labs(x = "Speed (Base modifier)", y = "Base Damage Resistance", colour = "Cluster Type")

#pkm_kmeans
pokemon_final_clusters_plot
No description has been provided for this image
In [19]:
Grade cell: cell-8c00c3f9bb72e130 Score: 0.0 / 0.0 (Top)
test_1.10()
Test passed 🥇
Test passed 😀
Test passed 🌈
Test passed 😀
Test passed 🌈
[1] "Success!"

Question 1.11
{points: 3}

This looks perhaps a bit better than when we used $K=4$ clusters originally, but is it really a lot better? Use this plot and the elbow plot from Question 1.8 to reason about what might be going on here.

Student's answer Score: 3.0 / 3.0 (Top)

I don't think that this graph is much better than the K = 4 Cluster graph.

When we look at the elbow graph of Question 1.8, we can clearly see that past K = 3, we start to have diminishing returns (interms of distance of WSS). Due to this, higher K values will often cause over-subdivision of our clusters. While it is true that the WSSD values are lower, we are creating more divisions / groups within our data. This is also why our graph (for K = 4) has 3 different colors for scatter points.

2. Tourism Reviews

Source: https://media.giphy.com/media/xUNd9IsOQ4BSZPfnLG/giphy.gif

The Ministry of Land, Infrastructure, Transport and Tourism of Japan is interested in knowing the type of tourists that visit East Asia. They know the majority of their visitors come from this region and would like to stay competitive in the region to keep growing the tourism industry. For this, they have hired us to perform segmentation of the tourists. A dataset from TripAdvisor has been scraped and it's provided to you.

This dataset contains the following variables:

  • User ID : Unique user id
  • Category 1 : Average user feedback on art galleries
  • Category 2 : Average user feedback on dance clubs
  • Category 3 : Average user feedback on juice bars
  • Category 4 : Average user feedback on restaurants
  • Category 5 : Average user feedback on museums
  • Category 6 : Average user feedback on resorts
  • Category 7 : Average user feedback on parks/picnic spots
  • Category 8 : Average user feedback on beaches
  • Category 9 : Average user feedback on theaters
  • Category 10 : Average user feedback on religious institutions

Question 2.0
{points: 3}

Load the data set from https://archive.ics.uci.edu/ml/machine-learning-databases/00484/tripadvisor_review.csv and clean it so that only the Category # columns are in the data frame (i.e., remove the User ID column).

Assign your answer to an object called clean_reviews.

In [20]:
Student's answer(Top)
# your code here
#fail() # No Answer - remove if you provide an answer
download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/00484/tripadvisor_review.csv", "data/data.csv")
clean_reviews <- read_csv("data/data.csv")|>
                select(-"User ID")


clean_reviews
Rows: 980 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): User ID
dbl (10): Category 1, Category 2, Category 3, Category 4, Category 5, Catego...
 Use `spec()` to retrieve the full column specification for this data.
 Specify the column types or set `show_col_types = FALSE` to quiet this message.
A tibble: 980 × 10
Category 1Category 2Category 3Category 4Category 5Category 6Category 7Category 8Category 9Category 10
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
0.931.82.290.620.802.423.192.791.822.42
1.022.22.660.641.423.183.212.631.862.32
1.220.80.540.530.241.543.182.801.312.50
0.611.320.670.431.301.783.172.811.343.02
0.930.200.130.430.300.403.182.981.122.46
0.930.561.130.511.342.363.182.871.342.40
In [21]:
Grade cell: cell-6050fab33fa36da5 Score: 3.0 / 3.0 (Top)
test_that('Did not create an object called clean_reviews', {
    expect_true(exists("clean_reviews"))
})
# The remainder of the tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.

### BEGIN HIDDEN TESTS
test_that('clean_reviews is not a tibble', {
    expect_true('tbl' %in% class(clean_reviews))
})
test_that('clean_reviews does not contain the correct data', {
    expect_equal(dim(clean_reviews), c(980, 10))
    expect_false('User ID' %in% colnames(clean_reviews))
    expect_equal(digest(int_round(sum(clean_reviews$`Category 1`), 2)), '2885dff545169dfd21c255f972530c43')
    expect_equal(digest(int_round(sum(clean_reviews$`Category 10`), 2)), '0a3dc4a87827d808b12aaf765f7823e2')
})
print("Success!")
### END HIDDEN TESTS
Test passed 🎊
Test passed 😀
Test passed 😸
[1] "Success!"

Question 2.1
{points: 3}

Perform K-means and vary $K$ from 1 to 10 to identify the optimal number of clusters. Use nstart = 100. Assign your answer to a tibble object called tourism_elbow_stats that has the columns num_clusters and total_WSSD.

Afterwards, create an elbow plot to help you choose $K$. Assign your answer to an object called tourism_elbow_plot.

In [22]:
Student's answer(Top)
#DON'T CHANGE THIS SEED VALUE
set.seed(2019)


tourism_recipe <- recipe( ~ . , clean_reviews) |> 
        step_scale(all_predictors()) |>
        step_center(all_predictors())


k_vals <- tibble(num_clusters = 1:10)
tourism_spec_tune <- k_means(num_clusters = tune()) |>
         set_engine("stats", nstart = 100)

tourism_wkflow <-workflow()|>
        add_recipe(tourism_recipe) |>
        add_model(tourism_spec_tune)|>
        tune_cluster(resamples = apparent(clean_reviews), grid = k_vals) |>
        collect_metrics()

tourism_elbow_stats <- tourism_wkflow |>
         mutate(total_WSSD = mean) |>
         filter(.metric == "sse_within_total") |>
         select(num_clusters, total_WSSD)

tourism_elbow_stats


tourism_elbow_plot <- tourism_elbow_stats|> 
                ggplot (aes (x = num_clusters, y = total_WSSD))+
                geom_line()+
                geom_point()+
                xlab("K value")+
                ylab("Total Within Cluster \n Sum of Squares")+
                scale_x_continuous(breaks = seq(0, 10, by = 1))

tourism_elbow_plot






# your code here
#fail() # No Answer - remove if you provide an answer
! Apparent: preprocessor 1/1, model 10/10: did not converge in 10 iterations

A tibble: 10 × 2
num_clusterstotal_WSSD
<int><dbl>
19790.000
27759.473
37090.652
85055.061
94841.085
104678.027
No description has been provided for this image
In [23]:
Grade cell: cell-6606935f013bdc26 Score: 3.0 / 3.0 (Top)
test_that('Did not create an object called elbow_stats', {
    expect_true(exists('elbow_stats'))
})
test_that('Did not create a plot called tourism_elbow_plot', {
    expect_true(exists('tourism_elbow_plot'))
})
# The remainder of the tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
### BEGIN HIDDEN TESTS
test_that('elbow_stats is not a tibble', {
    expect_true('tbl' %in% class(elbow_stats))
})
test_that('elbow_stats does not contain the correct data', {
    expect_equal(dim(elbow_stats), c(10, 2))
    expect_equal(digest(int_round(sum(elbow_stats$num_clusters), 2)), 'ae80db6ce6693e364a09f88dd4e3d553')
    expect_equal(digest(int_round(sum(elbow_stats$total_WSSD), 2)), 'ec4cf6680f8fc2d23a8982f80c85093f')
})
test_that('nstart is not 100', {
    expect_equal(digest(sum(elbow_plot$iter)), '1473d70e5646a26de3c52aa1abd85b1f')
})
properties <- c(tourism_elbow_plot$layers[[1]]$mapping, tourism_elbow_plot$mapping)
properties2 <- c(tourism_elbow_plot$later[[2]]$mapping, tourism_elbow_plot$mapping)
test_that('tourism_elbow_plot should be a line plot with points', {
    expect_true("GeomPoint" %in% c(class(tourism_elbow_plot$layers[[1]]$geom), class(tourism_elbow_plot$layers[[2]]$geom)))
    expect_true("GeomLine" %in% c(class(tourism_elbow_plot$layers[[1]]$geom), class(tourism_elbow_plot$layers[[2]]$geom)))
})
test_that('num_clusters should be on the x-axis', {
    expect_true(rlang::get_expr(properties$x) == 'num_clusters')
})
test_that('total_WSSD should be on the y-axis', {
    expect_true(rlang::get_expr(properties$y) == 'total_WSSD')
})
test_that('Labels on the axes should be descriptive and human readable.', {
        expect_false((tourism_elbow_plot$labels$y) == 'total_WSSD')
})
print('Success!')
### END HIDDEN TESTS
Test passed 🎊
Test passed 🥳
Test passed 🎊
Test passed 😸
Test passed 😸
Test passed 🎉
Test passed 🎉
Test passed 🎊
Test passed 🌈
[1] "Success!"

Question 2.2
{points: 3}

From the elbow plot above, which $k$ should you choose? Explain why you chose that $k$.

Student's answer Score: 3.0 / 3.0 (Top)

I choose a K value of 2. This is because the decrease in WSSD decreased the most when we went from K = 1 to 2. Then anything further (K value > 2), did not have nearly as big of a "jump" / cliff.

As mentioned earlier as well, a K value that's too high will lead to every piece of data being sub-divided. And you'll end up with a large amount of clusters. A K value of 2 ensures that we still have appropriate clustering. (Will not lead to clustering being too "varied", and general)

This is why I chose a K value of 2.

Question 2.3
{points: 3}

Run K-means again, with the optimal $K$, and assign your answer to an object called reviews_clusters. Use nstart = 100. Then, use the augment function to get the cluster assignments for each point. Name the data frame cluster_assignments.

In [24]:
Student's answer Score: 3.0 / 3.0 (Top)
#DONT CHANGE THIS SEED VALUE
set.seed(2019)

# your code here
#fail() # No Answer - remove if you provide an answer
tourism_final_kmeans <- k_means(num_clusters = 2) |>
         set_engine("stats", nstart = 100)

tourism_fits <- workflow()|>
                add_recipe(tourism_recipe)|>
                add_model(tourism_final_kmeans)|>
                fit(data = clean_reviews)

cluster_assignments <- augment(tourism_fits , clean_reviews)

cluster_assignments
A tibble: 980 × 11
Category 1Category 2Category 3Category 4Category 5Category 6Category 7Category 8Category 9Category 10.pred_cluster
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><fct>
0.931.82.290.620.802.423.192.791.822.42Cluster_1
1.022.22.660.641.423.183.212.631.862.32Cluster_1
1.220.80.540.530.241.543.182.801.312.50Cluster_2
0.611.320.670.431.301.783.172.811.343.02Cluster_2
0.930.200.130.430.300.403.182.981.122.46Cluster_2
0.930.561.130.511.342.363.182.871.342.40Cluster_1

For the following 2 questions use the following plot as reference.

The visualization below is a density plot, you can think of it as a smoothed version of a histogram. Density plots are more effective for comparing multiple distributions. What we are looking for with these visualizations, is to see which variables have difference distributions between the different clusters.

In [25]:
options(repr.plot.height = 8, repr.plot.width = 15)
cluster_assignments |>
    pivot_longer(cols = -.pred_cluster, names_to = 'category', values_to = 'value')  |> 
    ggplot(aes(value, fill = .pred_cluster)) +
        geom_density(alpha = 0.4, colour = 'white') +
        # We are setting the x-scale to "free" since we standardized the rating values before clustering them,
        # which means that their original range (which is what we show here) does not matter
        facet_wrap(facets = vars(category), scales = 'free') +
        theme_minimal() +
        theme(text = element_text(size = 20))
No description has been provided for this image

Question 2.4 Multiple Choice:
{points: 1}

From the plots above, point out the categories that we might hypothesize are driving the clustering? (i.e., are useful to distinguish between the type of tourists?) We list the table of the categories below.

  • Category 1 : Average user feedback on art galleries
  • Category 2 : Average user feedback on dance clubs
  • Category 3 : Average user feedback on juice bars
  • Category 4 : Average user feedback on restaurants
  • Category 5 : Average user feedback on museums
  • Category 6 : Average user feedback on resorts
  • Category 7 : Average user feedback on parks/picnic spots
  • Category 8 : Average user feedback on beaches
  • Category 9 : Average user feedback on theaters
  • Category 10 : Average user feedback on religious institutions

A. 10, 3, 5, 6, 7

B. 10, 3, 5, 6, 1

C. 10, 3, 4, 6, 7

D. 10, 2, 5, 6, 7

Assign your answer to an object called answer2.4. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [26]:
Student's answer(Top)
# your code here
#fail() # No Answer - remove if you provide an answer
answer2.4 <- "A"
In [27]:
Grade cell: cell-96590ae632b0f9c5 Score: 1.0 / 1.0 (Top)
### BEGIN HIDDEN TESTS
test_that('Did not create an object called answer2.4', {
    expect_true(exists('answer2.4'))
    })

test_that('Solution is incorrect', {
    expect_equal(digest(answer2.4), '75f1160e72554f4270c809f041c7a776')
    })
print("Success!")
### END HIDDEN TESTS
Test passed 😀
Test passed 🎊
[1] "Success!"

Question 2.5
{points: 3}

Discuss one disadvantage of not being able to visualize the clusters when dealing with multidimensional data.

Student's answer Score: 3.0 / 3.0 (Top)

The biggest issue with not being able to visualize multi-dimensional data (with clusters), is that it's hard to see which variables contribute the most to clustering. (which is also impacts how they are grouped). As humans, one of the best ways to visualize and learn from data is through graphs, and when we don't have visuals, everything just becomes numbers and values.

In [28]:
source("cleanup.R")
In [ ]: